Partial zarr download/upload (stage: Design + Implementation)#1816
Partial zarr download/upload (stage: Design + Implementation)#1816yarikoptic wants to merge 6 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1816 +/- ##
==========================================
+ Coverage 75.12% 75.57% +0.44%
==========================================
Files 84 86 +2
Lines 11930 12230 +300
==========================================
+ Hits 8963 9243 +280
- Misses 2967 2987 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Covers five areas:
- --zarr TYPE:PATTERN filtering for download (glob, path, regex)
- URL parsing with zarr boundary detection (AssetZarrEntryURL)
- --zarr-mode {full, patch} for upload
- Checksums and manifests (per-directory checksums are NOT
persisted on the archive; legacy .checksum files exist on S3
under zarr-checksums/ for ~72% of older zarrs but are orphaned)
- dandi ls for zarr contents
Related: #1462, #1474
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --zarr CLI option for download to filter entries within zarr assets (glob/path/regex patterns with predefined 'metadata' alias), and --zarr-mode option for upload to support 'patch' mode (upload/update without deleting remote-only files). Key changes: - New dandi/zarr_filter.py: filter parsing, matching, and aliases - URL parsing: AssetZarrEntryURL for URLs pointing into zarr assets - Download pipeline: thread zarr_entry_filter through Downloader and _download_zarr, skip deletion and checksum when filter active - Upload pipeline: zarr_mode='patch' skips remote file deletion and client-side checksum verification - dandi ls: list zarr entries when URL points into a zarr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-compile regex patterns at ZarrFilter construction time, catching
invalid patterns early instead of on every matches() call (B1)
- Remove redundant get_asset_download_path override in AssetZarrEntryURL
that was identical to the inherited SingleAssetURL method (H1)
- Use Literal["full", "patch"] for zarr_mode parameter instead of bare
str to prevent silent misbehavior on invalid values (H2)
- Collapse consecutive ** glob segments to avoid exponential
backtracking in _glob_match_parts (H3)
- Simplify split_zarr_location to use str.split instead of
PurePosixPath (M1)
- Add explanatory comment for type: ignore in parse_zarr_filter (M2)
- Yield {"status": "done"} when zarr filter matches zero entries (M4)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
click.Choice returns str at runtime, but upload() expects Literal["full", "patch"]. Add typing cast at the call site. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow existing codebase pattern (UploadExisting, DownloadExisting, etc.) to define ZarrMode as a str Enum in upload.py. Eliminates the duplicated Literal["full", "patch"] across three files and the cast() workaround in cmd_upload.py. Uses TYPE_CHECKING guard in files/zarr.py to avoid circular import (files/zarr.py -> upload.py -> .files). Co-Authored-By: Claude Code 2.1.63 / Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Code 2.1.63 / Claude Opus 4.6 <noreply@anthropic.com>
| ## Primary Use Case | ||
|
|
||
| From @kabilar (#1462): | ||
|
|
||
| ```bash | ||
| # 1. Download just the metadata files from a zarr | ||
| dandi download --zarr glob:'**/.z*' --zarr glob:'**/zarr.json' \ | ||
| dandi://dandi/001289/rawdata/.../PC.ome.zarr | ||
|
|
||
| # 2. Edit .zattrs locally | ||
| vim PC.ome.zarr/.zattrs | ||
|
|
||
| # 3. Upload changes without deleting remote data chunks | ||
| dandi upload --zarr-mode patch | ||
| ``` |
There was a problem hiding this comment.
Thanks @yarikoptic. An additional use case would be adding shards to an existing Zarr. For example, we have 355+ TB from a subject that were are stitching together into a single Zarr. We don't have enough space locally so will need to upload a shard, delete the local copy, and repeat the process.
| See separate implementation task. Summary: allow `dandi ls` to list files | ||
| within a Zarr asset when given a Zarr URL, using `asset.iterfiles(prefix=...)`. | ||
| Reuses `AssetZarrEntryURL` from Part 2. |
There was a problem hiding this comment.
| See separate implementation task. Summary: allow `dandi ls` to list files | |
| within a Zarr asset when given a Zarr URL, using `asset.iterfiles(prefix=...)`. | |
| Reuses `AssetZarrEntryURL` from Part 2. | |
| This will be implemented as a separate pull request. Summary: allow `dandi ls` to list files | |
| within a Zarr asset when given a Zarr URL, using `asset.iterfiles(prefix=...)`. | |
| Reuses `AssetZarrEntryURL` from Part 2. |
| 1. **`--zarr` AND vs OR**: Multiple `--zarr` options use OR. Should we support | ||
| AND composition (e.g., `--zarr path:0/1 --zarr-and glob:'**/.zattrs'`)? | ||
| For now, OR suffices; AND can be added later. |
There was a problem hiding this comment.
Just having the OR operator is fine with me.
| 3. **Interaction with `--sync`**: When `dandi download --zarr ... --sync` is | ||
| used, should `--sync` only delete local zarr entries that match the filter | ||
| but aren't remote? Or should `--sync` be disallowed with `--zarr`? | ||
| Recommendation: disallow `--sync` with `--zarr` initially, as the semantics | ||
| are ambiguous. |
There was a problem hiding this comment.
As mentioned, since the semantics could be confusing, I would suggest disallowing --sync with --zarr for dandi download.
| 2. **Server-side glob for zarr entries**: Currently the server only supports | ||
| `prefix` filtering on zarr entries (not glob). Glob filtering therefore | ||
| happens client-side after fetching entries. For large zarrs with millions of | ||
| entries, this could be slow. A future server API enhancement for | ||
| entry-level glob would help. For now, `path:` filters should be preferred | ||
| for large zarrs to minimize data transfer. |
There was a problem hiding this comment.
This works. We are using Zarr v3 so glob filtering on the client-side shouldn't be an issue.
| | `dandi/files/zarr.py` | `iter_upload()` patch mode support | | ||
| | `dandi/cli/cmd_ls.py` | Zarr contents listing (separate PR) | | ||
|
|
||
| ## Open Questions |
There was a problem hiding this comment.
I agree with all of the proposed solutions for the open questions.
There was a problem hiding this comment.
Thanks @yarikoptic. This design looks good and works for our two use cases: 1. updating Zarr metadata, and 2. adding chunks/shards to an existing Zarr.
Summary
Design and implementation for partial zarr download and upload support, addressing #1462, #1474, and related archive issues.
The PR covers five areas:
--zarr TYPE:PATTERNfiltering fordandi download— glob, path, and regex filters for selecting entries within zarr assets, with ametadataalias for common zarr metadata filesAssetZarrEntryURLto handle URLs likedandi://dandi/000108/.../file.ome.zarr/0/0/0--zarr-mode {full, patch}fordandi upload— patch mode uploads changed files without deleting remote files absent locallyzarr_checksumlibrary but are NOT persisted (only the root digest is stored in the DB); legacy.checksumfiles exist on S3 atzarr-checksums/for ~72% of older zarrs but are orphaned since Dec 2022dandi lsfor zarr contents — listing files within a zarr asset when URL points into a zarrKey findings from investigation
zarr_checksumalgorithm IS hierarchical (Merkle tree, bottom-up viaZarrChecksumTree)ingest_zarr_archivetask computes checksums entirely in memory and stores only the root digest.checksumfiles on S3 (zarr-checksums/prefix) were written byZarrChecksumFileUpdater, removed in dandi-archive PRs Testing started to fail due to an error in parsing (?) + new deprecationwarning #1390/[FEAT] dandi fsspec filesystem #1395 (Dec 2022). Legacy files remain for older zarrs but no API exposes themImplementation
New files
dandi/zarr_filter.py—ZarrFilterdataclass with glob/path/regex matching,parse_zarr_filter(),make_zarr_entry_filter(),ZARR_FILTER_ALIASES(includesmetadataalias)dandi/tests/test_zarr_filter.py— 52 unit tests covering all filter types, parsing, aliases, edge cases, invalid regex validationModified files
dandi/dandiarchive.py—split_zarr_location(),AssetZarrEntryURLclass, updatedparse_dandi_url()to detect zarr boundariesdandi/download.py—zarr_filtersparameter ondownload(), filter threading throughDownloaderand_download_zarr(), skip deletion/checksum when filter activedandi/files/zarr.py—zarr_mode: Literal["full", "patch"]oniter_upload(), patch mode skips remote file deletion and client-side checksum verificationdandi/upload.py—zarr_modeparameter onupload(), conditional pass-through forZarrAssetdandi/cli/cmd_download.py—--zarrclick option (multiple, OR logic)dandi/cli/cmd_upload.py—--zarr-modeclick optiondandi/cli/cmd_ls.py— list zarr entries when URL isAssetZarrEntryURLdandi/cli/tests/test_download.py— updated mock expectations for newzarr_filtersparameterIntegration tests
dandi/tests/test_download.py— 6 tests (glob filter, metadata alias, no-delete, path filter, nonexistent filter, sync conflict)dandi/tests/test_upload.py— 3 tests (patch no-delete, full delete, patch updates)dandi/tests/test_dandiarchive.py— 3 URL parsing cases + 8split_zarr_locationcasesReview checklist
Please review the design at
doc/design/partial-zarr.mdand comment on:--zarr TYPE:PATTERNsyntax — is the filter approach right? Are glob/path/regex the right types?metadataalias expansion — doesglob:**/.z*+glob:**/zarr.json+glob:**/.zmetadatacover all cases?--zarr-mode patchsemantics — is "upload without deleting" the right default for patch?Should subtree cleanup happen?AssetZarrEntryURLwith zarr boundary detection the right approach?--syncconflict raises error, server-side glob deferred)zarr-checksums/files on S3 be cleaned up as part of this or separately?TODO
dandi/zarr_filter.py— filter parsing and matchingAssetZarrEntryURLandsplit_zarr_location()indandi/dandiarchive.py--zarroption todandi downloadCLI_download_zarr()for partial download support--zarr-modeoption todandi uploadCLIiter_upload()(dandi/files/zarr.py)zarr_modethroughdandi/upload.pydandi lszarr contents supportzarr_filter.py(52 tests)split_zarr_location(11 test cases)Literaltyping,**collapse, redundant override removal)🤖 Generated with Claude Code